vl model
- Health & Medicine (0.68)
- Education (0.46)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Illinois (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- Education (0.67)
- Law > Civil Rights & Constitutional Law (0.67)
Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models Ziyi Yin 1 Muchao Y e
Vision-Language (VL) pre-trained models have shown their superiority on many multimodal tasks. However, the adversarial robustness of such models has not been fully explored. Existing approaches mainly focus on exploring the adversarial robustness under the white-box setting, which is unrealistic. In this paper, we aim to investigate a new yet practical task to craft image and text perturbations using pre-trained VL models to attack black-box fine-tuned models on different downstream tasks.
- North America > United States > Pennsylvania (0.04)
- North America > United States > New York > Suffolk County > Stony Brook (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (2 more...)
- Information Technology > Security & Privacy (1.00)
- Government (0.84)
- Asia > China > Zhejiang Province > Ningbo (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- North America > Canada > Ontario > Toronto (0.14)
- North America > United States (0.04)
- Europe > United Kingdom (0.04)
- Europe > Austria > Styria > Graz (0.04)
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Vision and Language (VL) models offer an effective method for aligning representation spaces of images and text allowing for numerous applications such as cross-modal retrieval, visual and multi-hop question answering, captioning, and many more. However, the aligned image-text spaces learned by all the popular VL models are still suffering from the so-called'object bias' - their representations behave as'bags of nouns' mostly ignoring or downsizing the attributes, relations, and states of objects described/appearing in texts/images. Although some great attempts at fixing these `compositional reasoning' issues were proposed in the recent literature, the problem is still far from being solved. In this paper, we uncover two factors limiting the VL models' compositional reasoning performance. These two factors are properties of the paired VL dataset used for finetuning (or pre-training) the VL model: (i) the caption quality, or in other words'image-alignment', of the texts; and (ii) the'density' of the captions in the sense of mentioning all the details appearing on the image. We propose a fine-tuning approach for automatically treating these factors on a standard collection of paired VL data (CC3M). Applied to CLIP, we demonstrate its significant compositional reasoning performance increase of up to $\sim27$\% over the base model, up to $\sim20$\% over the strongest baseline, and by $6.7$\% on average. Our code is provided in the Supplementary and would be released upon acceptance.
Localized Symbolic Knowledge Distillation for Visual Commonsense Models
Instruction following vision-language (VL) models offer a flexibleinterface that supports a broad range of multimodal tasks in a zero-shot fashion.However, interfaces that operate on full images do not directly enable the user to"point to and access specific regions within images. This capability is importantnot only to support reference-grounded VL benchmarks, but also, for practicalapplications that require precise within-image reasoning. We build LocalizedVisual Commonsense model which allows users to specify (multiple) regions-as-input. We train our model by sampling localized commonsense knowledgefrom a large language model (LLM): specifically, we prompt a LLM to collectcommonsense knowledge given a global literal image description and a localliteral region description automatically generated by a set of VL models. Thispipeline is scalable and fully automatic, as no aligned or human-authored imageand text pairs are required. With a separately trained critic model that selectshigh quality examples, we find that training on the localized commonsense corpusexpanded solely from images can successfully distill existing VL models to supporta reference-as-input interface. Empirical results and human evaluations in zero-shotsettings demonstrate that our distillation method results in more precise VL modelsof reasoning compared to a baseline of passing a generated referring expression.
- Asia > China > Zhejiang Province > Ningbo (0.04)
- Asia > China > Zhejiang Province > Hangzhou (0.04)
- Health & Medicine (0.68)
- Education (0.46)
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Illinois (0.04)
- North America > United States > Florida > Palm Beach County > Boca Raton (0.04)
- Education (0.67)
- Law > Civil Rights & Constitutional Law (0.67)